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OPTIMALITY  OF  STATIONARY  HALTING  POLICIES 
AND  FINITE  TERMINATION  OF  SUCCESSIVE  APPROXIMATIONS 


w 


1.  Introduction 

Consider  a discrete-time-parameter  S-state  finite-action  branching 
Markov  decision  chain.  Attention  centers  here  on  halting  (resp. , 
stopping)  policies,  i.e. , those  for  which  the  expected  population  size 
at  time  N is  zero  for  some  N (resp. , converges  to  zero  as  N 
approaches  infinity).  The  value  of  a policy  is  the  expected  infinite- 
horizon  income  that  it  earns.  The  supremum  of  the  values  of  the  halting 
(resp.,  stopping)  policies  is  the  optimal  halting  (resp.,  stopping) 
value  of  the  decision  chain.  In  general  these  values  are  not  the  same. 
An  optimal  halting  (resp. , stopping)  policy  is  one  having  maximum 
value  in  that  class  of  policies. 

Eaves  and  Veinott  [l]  have  shown  that  if  there  is  a stopping 
policy  and  all  rewards  are  finite,  then  there  is  a stationary  optimal 
stopping  policy  if  and  only  if  the  optimal  stopping  value  is  finite. 
Moreover,  when  initiated  with  the  value  of  a stopping  policy,  they  have 
shown  that  the  iterates  of  successive  approximations  converge  to  the 
optimal  stopping  value;  also  that  value  is  a fixed  point  of  the  optimal 
return  operator. 

The  purpose  of  this  paper  is  to  investigate  the  following  addi- 
tional problems  under  the  hypothesis  that  the  rewards  are  all  real 
(resp. , real  or  minus  infinity)  valued.  When  does  there  exist  a halting 
stationary  optimal  stopping  (resp. , halting)  policy?  When  do  the 
iterates  of  successive  approximations  converge  in  finitely  many  steps 
assuming  initiation  with  the  value  of  a stationary  halting  policy? 
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The  motivation  for  studying  these  problems  comes  from  a companion 
paper  with  Veinott  [2].  There  we  show  that  the  problem  of  finding 
a minimum-concave-cost  flow  in  a single-source  network  can  be  reduced 
to  finding  a stationary  optimal  halting  policy  in  an  associated  branching 
Markov  decision  chain. 

In  order  for  there  to  be  a halting  optimal  stopping  (resp. , halting) 
policy,  there  must  be  a halting  policy.  Section  5 is  concerned  with 
finding  such  a policy.  To  describe  the  results,  let  the  halting  time 
of  a policy  from  a state  s be  the  first  time  at  which  the  expected 
population  size  is  zero  starting  from  s,  if  there  is  such  a time; 
otherwise,  let  the  halting  time  from  s be  infinity.  Also  let  the 
halting  time  of  a policy  be  the  largest  of  its  halting  times  from 
each  state.  The  main  result  of  Section  3 is  that  there  is  a stationary 
policy  that  simultaneously  minimizes  the  halting  time  from  each  state, 
and  each  of  the  finite  halting  times  of  that  policy  is  S or  less. 

The  proof  of  this  result  is  a constructive  combinatorial  algorithm 
for  finding  the  desired  policy  and  its  halting  times  from  each  state. 

One  consequence  of  the  above  result  is  that  there  is  a halting  policy 
if  and  only  if  there  is  a stationary  halting  policy.  Moreover,  that 
is  so  if  and  only  if  the  (stationary)  policy  found  by  the  above  algor- 
ithm is  halting,  or  what  is  the  same  thing,  its  halting  time  is  S 
or  less.  These  results  complement  those  of  Rothblum  [4,  p.  74]  con- 
cerning the  case  where  instead  every  policy  is  halting  (he  calls  halting 
policies  nilpotent). 

The  main  results  are  given  in  Section  4.  There  we  characterize 
the  existence  of  halting  stationary  optimal  stopping  (resp. , halting) 
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policies  by  the  condition  that  successive  approximations  terminates 


in  finitely  many  steps.  More  precisely,  suppose  the  rewards  are  real 
(resp. , real  or  minus  infinity)  valued  and  successive  approximations 
is  initiated  with  the  value  of  a stationary  halting  policy.  Then 
the  N-th  iterate  of  successive  approximations  is  a fixed  point  of  the 
optimal  return  operator  for  some  N if  and  only  if  that  is  so  for 
some  N not  exceeding  the  largest  of  the  halting  times  of  the  stationary 
halting  policies;  moreover,  this  occurs  if  and  only  if  there  exists 
a halting  stationary  optimal  stopping  (resp.,  halting)  policy.  Further- 
more, when  this  is  so,  successive  approximations  terminates  at  the 
N-th  iteration  with  such  a policy,  and  its  value  is  the  indicated 
fixed  point.  Analogous  results  are  also  established  where  successive 
approximations  is  replaced  by  a Gauss-Seidel  version.  The  running 
time  of  each  of  the  above  methods  is  proportional  to  the  product  of 
the  numbers  of  states  and  nonzero  data  elements,  i.e.,  rewards  and 
transition  probabilities. 
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Preliminaries. 


2^ 

Following  Eaves  and  Veinott  [ 1 ],  Veinott  [ 6 ],  and  Rothblum-Veinott 
[ 5 ] a branching  Markov  decision  chain  will  now  be  described.  Consider 
a population  consisting  of  a finite  set  of  individuals  each  of  which 
is  observed  at  a sequence  of  points  in  time  labeled  1,  2,  ...  . An 
individual  observed  at  a given  time  point  is  found  to  be  in  a finite 
set  of  states  labeled  1,  2,  ...  , S.  If  there  is  no  individual 
in  any  state,  the  population  is  said  to  have  stopped.  Each  time  an 
individual  is  observed  in  state  s,  an  action  a is  chosen  from  a 
nonempty  finite  set  Ag  of  possible  actions  in  state  s and  a reward 
-°°  < r(s,a)  <»  is  received.  The  expected  number  of  individuals  in 
state  t at  time  N+l  generated  by  each  individual  in  state  s at 
time  N,  given  that  action  a was  chosen  at  time  N and  given  the 
states  observed  and  actions  taken  at  times  1,  2,  ...  , N-l  is  assumed 
to  be  a real  nonnegative  function  p(t|s,a)  depending  only  on  t,  s, 
and  a . 

Let  A = Xg-jA-g  be  the  set  of  all  decisions  and  let  a policy  be 
a sequence  7T  = (8^,  8^,  ...)  of  decisions.  Using  a policy  w means 

g 

that  if  an  individual  is  observed  in  state  s at  time  N,  then  8^ 

00 

is  the  action  chosen  at  that  time.  Let  8 denote  the  sequence 
(8,  8,  ...)  and  call  it  a stationary  policy. 

For  any  8 e A,  let  r&  be  the  S-vector  whose  s-th  component  is 
r(s,6S)  and  let  be  the  nonnegative  S X S matrix  whose  st-th 
element  is  p(t|s,8°).  The  elements  of  Pg  will  be  referred  to  as 
transition  rates.  Let  P^  = P^  •••  P^  where  TT  = (8^,  8^,  ...). 

A state  t is  said  to  be  accessible  from  state  s in  N steps  using 
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policy  7 T if  P ^ >0.  A state  is  always  accessible  from  itself  in 
zero  steps.  A state  is  called  immediately  accessible  from  another  when 
that  is  so  in  one  step. 

Define  the  S-vector  v , called  the  value  of  7T,  by 

v = lim  sup  v = lim  sup  £ FY 

k - » k -*  ~ N=0  N+l 

Call  the  s-th  component  of  v the  value  of  7T  at  initial  state  s. 

00 

-\  N 

A policy  7 r is  called  transient  if  ^ P converges,  and  in  this 

N=0 

case 


v=  s 

N=0 


A. 

7 r 5 


N+l 


since  the  sum  converges  absolutely. 

N 

Call  a policy  tt  halting  if  = 0 for  some  N > 0,  and  s topping 
N 

if  P^  -*  0 as  N -*  00 . Of  course  halting  policies  are  stopping.  A 

halting  (resp.,  stopping)  policy  tt  will  be  called  optimal  halting 

(resp. , stopping)  if  v > vff  for  all  halting  (resp.,  stopping)  policies 

a.  in  that  event,  v is  called  the  optimal  halting  (resp.,  stopping) 

value  of  the  system.  The  term  "branching"  refers  to  the  fact  that  we 

require  the  transition  rates  to  be  nonnegative  only.  If  in  addition 

we  assume  that  the  sum  of  the  transition  rates  in  each  row  of  P_  is 

o 

one  or  less  for  each  decision  6,  then  we  obtain  an  ordinary  Markov 

decision  chain.  If  such  is  the  case,  then  the  conditional  probability 

that  a subsystem  enters  the  stopped  state  at  time  N+l  given  that  it 

is  observed  in  state  s at  time  N and  action  a is  chosen  then 
S 

is  1-2  P(t|s,  a).  Associated  with  each  decision  6 is  a (directed) 
t=l 
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S and  whose  arcs 


graph  Gg  whose  nodes  are  the  states  1,  2,  ...  , 

are  the  ordered  pairs  (s,t)  such  that  P_  > 0.  A graph  G is 

OS  u 

called  circuitless  if  the  nodes  can  be  labeled  1,  2,  ...  , n so  that 
if  (s,t)  is  an  arc  in  G,  then  s < t.  A stationary  policy  6°° 
is  halting  if  and  only  if  the  graph  Gg  is  circuitless. 

5.  Characterization  of  Halting  Policies. 

Rothblum  [4,  p.  74]  refers  to  halting  policies  as  "nilpotent" 

policies.  Define  the  halting  time  h of  a policy  7 r to  be  the 

N 

smallest  integer  N > 0 such  that  P^  = 0 if  ir  is  halting,  and 

set  h = 00  otherwise.  Note  that  if  policy  ir  is  used,  then  the 
7T 

individuals  are  almost  surely  in  the  stopped  state  at  time  h^.,  i.e., 

the  population  has  stopped. 

00 

A decision  6 is  called  halting  if  that  is  so  for  5 . Denote 

by  r the  set  of  halting  decisions  and  let  h = max  h be  called 

6eT  ° 

the  halting  time  of  the  system  where  h&  = h m and  h = °°  if  T = j. 

6 

Define  h^,  the  halting  time  of  the  policy  ir  from  state  t, 

N 

as  the  smallest  integer  N such  that  the  t-th  row  of  P^  vanishes 
if  such  an  integer  exists,  and  h^  = » otherwise.  Thus  h ^ is 
the  smallest  integer  N > 0 such  that  the  population  stops  in  N steps 
or  less  from  t.  Note  that  h^  is  a "combinatorial"  property  of  tt 
in  the  sense  that  it  depends  on  the  location  but  not  the  magnitude  of 
the  positive  elements  of  the  Pg. 

The  halting  set  'tt  is  the  set  of  states  t such  that  h^  < 00 
for  some  ir.  Hence  t € if  there  exists  a policy  with  finite  halting 
time  from  state  t.  The  proof  of  the  following  result  not  only  constructs 
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the  set  but  also  exhibits  a decision  6 which  satisfies  hOJ_  < 00 

ot  j 

4 

for  each  t in  where  = h m . Moreover,  hg^_  is  computed  for  | 

8 t 

each  t e 'H. 

Theorem  3.1.  (Existence  of  Stationary  Policies  With  Minimum  Halting  j 

Times) 

There  is  a stationary  policy  which  simultaneously  minimizes  the 
halting  times  from  each  state.  Moreover,  the  halting  time  of  that  policy 
is  S or  less  from  each  state  in  1 


Proof.  Let  be  the  set  of  states  from  which  there  is  a policy  with 

halting  time  k or  less.  Then  Hq  = 0.  Also  = Hk  for 

k > 1 where  Ik  is  the  set  of  states  s not  in  ^ such  that 
for  some  action  8s,  say,  in  As,  p(t|s,8s)  = 0 for  each  t /(  1 . 

Since  the  1^  are  disjoint,  there  is  an  integer  N < S such  that 


h+1  = and  so  ^ = H: 


N+l 


= Vc-  For  each  s define 


8 in  A arbitrarily.  For  each  s e Vr,  hc  = k where  s e L, 
s 8s  k 

and  from  the  construction  this  is  the  minimum  halting  time  from  s.  Q.E.D. 


Note  that  the  number  of  operations  required  to  obtain  a decision 

8 with  minimum  halting  time  is  O(S^z)  where  z is  the  average  number 

of  actions  in  a state.  Moreover,  if  i+  -A  , then  the  decision  8 

exhibited  in  the  proof  is  halting.  The  proof  implies  that  if  'H  fjg>, 

then  for  each  i e JV+  the  i-th  row  of  P^  contains  a nonzero 

element  for  each  policy  ir  and  each  integer  N > 1. 

N 

A matrix  P is  called  nilpoter.t  if  P = 0 for  some  N.  The 
spectrum  of  a matrix  is  defined  as  the  set  of  its  eigenvalues.  Let 
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L(G)  be  the  number  of  nodes  in  a maximal  chain  (i.e.,  a directed  path 
with  no  repetition  of  nodes)  in  the  graph  G.  The  following  lemma 
states  several  known  (e.g.,  Rothblum  [ h- , p.  74],  Kato  fj,  pp.  22,  38]) 
characterizations  of  halting  decisions. 

Lemma  3.2.  (Characterization  of  Halting  Decisions) 

If  8 is  a decision,  the  following  are  equivalent: 


(a)  5 is  halting. 

(b)  Pg  is  nilpotent. 

(c)  The  spectrum  of  P&  is  (o). 

(d)  Gg  is  circuitless. 

(e)  hg  = L(Gg). 

(f)  hg<S. 

(g)  hg  < 00. 

The  next  result  is  immediate  from  Theorem  3*1  and  Lemma  3.2. 

Theorem  3«3»  (Existence  of  Halting  Policies) 

The  following  are  equivalent: 

(a)  There  exists  a stationary  halting  policy. 

(b)  Pg  is  nilpotent  for  some  decision  5. 

(c)  The  spectrum  of  Pg  is  (o)  for  some  decision  8. 

(d)  G is  circuitless  for  some  decision  8. 

8 

(e)  hg  = L(Gg)  for  some  decision  8. 

(f)  hg  < S for  some  decision  8. 

(g)  h < S. 
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(h)  There  exists  a halting  policy. 

(i)  ^=1$. 

Remark.  The  computation  of  h is  NP-complete  since  a solution  can  be 
verified  in  polynomial  time  and  the  longest  path  problem  (which  is 
NP-complete)  can  be  transformed  to  the  present  problem  in  polynomial 
time.  To  see  this,  let  the  states  be  nodes  and  let  the  actions  deter- 
mine whether  to  stop  or  choose  an  arc  leading  to  an  adjacent  node. 

The  graph  of  a halting  decision  6 is  circuitless  and  hg  is  the 
length  of  it's  longest  path.  Note  that  hg  equals  the  number  of 
nodes  if  and  only  if  there  is  a Hamiltonian  path  (i.e.,  a path  con- 
taining all  the  nodes)  in  the  graph. 


t.  Stopping  Optimality  and  Finite  Termination  of  Successive  Approxi- 
mations. 

In  this  section  the  existence  of  a halting  stationary  optimal 

halting  (resp. , stopping)  policy  will  be  characterized  by  the  condition 

that  successive  approximations  terminates  in  finitely  many  steps. 

First,  define  the  optimal  return  operator  a by  "ftv  = maxg^R^v, 

where  R^v  = r^  + P^v.  T^e  method  of  successive  approximations  is 

the  repeated  application  of  the  optimal  return  operator,  being 

the  k-th  approximation  using  v as  the  initial  approximation.  When 

r8  f'or  Eaves  and  Veinott  [l]  have  shown  that  for  every 

stopping  value  v°,  -a"  V°  ^v  where  v is  the  optimal  stopping  value  and 

* 

there  is  a stationary  optimal  stopping  policy  if  and  only  if  v 
is  finite.  In  our  study  of  halting  policies,  the  initial  approxi- 
mation v°  will  be  the  value  of  any  halting  decision  7 where  we 
require  only  that  < r « °°.  Then  v^  is  the  unique  v « 00 
satisfying  the  recursive  system  v = R^v.  Recall  also  that  a halting 
7 can  be  constructed  as  in  the  proof  of  Theorem  3.1,  if  one  exists. 

Lemma  4.1. 

If  7 is  a halting  decision,  then  v is  nondecreasing  in 
k > 0 and  is  the  value  of  a halting  policy  for  each  k > 0.  Also, 
k k 

> R^v  = Vy  for  each  k > h^,  and  v. 

Proof. 

Since  7 is  halting  vy  > R^v^  = vy  so  v7  > "ft  vr 

for  k > 0.  Also  J%,  Vy  is  the  value  of  a halting  policy  that  uses 
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7 in  each  period  following  period  k.  The  final  assertion  is  immediate 

N 

on  noting  that  = 0. 

Under  the  assumption  that  r^  » for  all  6,  Eaves  and  Veinott 
[l]  have  characterized  when  a stationary  optimal  stopping  policy  exists. 
By  contrast,  the  next  result  characterizes  when  that  policy  can  also 
be  taken  to  be  halting.  It  asserts  that  such  a policy  exists  if  and 
only  if  the  method  of  successive  approximations  terminates  in  finitely 
many  steps  when  initiated  with  the  value  of  any  halting  decision. 

Theorem  4.2.  (Existence  of  Stationary  Optimal  Halting  Policies) 

If  there  is  a halting  policy,  then  the  following  are  equivalent. 

(a)  There  is  a stationary  optimal  halting  policy. 

(b)  is  the  optimal  halting  value  for  every  i > h (resp., 
some  i > 0)  and  every  (resp.,  some)  halting  decision  7. 

(c)  H1  v^  is  a fixed  point  of  ^ for  every  i > h (resp., 

some  i > 0)  and  every  (resp.,  some)  halting  decision  7. 

If  also  r-  » for  all  5,  then  the  above  conditions  remain 
o 

equivalent  on  replacing  "optimal  halting"  with  "optimal  stopping"  and 
inserting  "halting"  before  "stationary"  everywhere. 

Proof. 

(a)  =>  (b).  Qy  hypothesis,  there  is  a stationary  optimal 
halting  policy  S , say.  For  each  halting  decision  7 and  i > h 
it  follows  from  Lemma  4.1  that  v^  > > R5v7  = v5  50  v5  = ^V 
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(b)  =>  (c).  By  Lemma  4.1  and  hypothesis 

(H\ ) = • 


>1 >1& 


k k 

(c)  =>  (a).  By  Lemma  4.1  v =-%  V is  nondecreasing  in  k > 0, 
k i 

and  by  hypothesis  v = v for  k > i.  Choose  the  decisions  8^  = 7, 

8^,  8g,  ...  recursively  so  that  v‘"  = R g vk  1 and  8^  = &£  ^ if 

k k— 1 ^ 

v = (Rc  v ) for  s e S and  k > 0.  We  now  show  that  8.  is 
s 8,  , s l 

k-1 


halting. 


If  not,  the  graph  of  8^  contains  a circuit  f on  the 


set  of  states  C say.  Let  m > 0 be  the  smallest  integer  such  that 

8s  = 5s  ..  = • • • = 8?  for  each  s e C.  Since  8^  = 7 is  halting, 
m m+1  l 0 

k k-1  -1 

m > 0.  Because  v > v (with  v = v ) 

/.  \ k+1  t k\  . / _ k— 1\  k 

(!)  Vo  > (**  v ) > (Rr  v K = 


for  each  s e Bj  and  k > 0. 


Now  we  show  by  induction  on  k that  for  each  m-1  < k < i,  at 
least  one  of  the  inequalities  in  (l)  is  strict  for  some  s e C 
(depending  on  k) . Since  5m  p p ^or  some  s e C,  the  first  in- 
equality in  (l)  is  strict  for  that  s by  construction  so  the  claim 

holds  for  k = m-1.  Suppose  it  holds  for  k-l(r.i-l  < k-1  < i)  and 

k k-1 

consider  k.  Then  v > v for  some  t e C.  Now  by  definition  of 

"0  o 

C,  there  is  an  s e C such  that  (s,t)  is  an  arc  of  the  graph  of 

8 . Thus  the  second  inequality  in  (l)  is  strict  if  v > -*»  for  each 
k U- 

u in  the  set  <u  of  states  immediately  accessible  from  s when  using 

8^.  Since  > vm,  it  suffices  to  show  v™  > -°°  for  each  u e^. 

We  now  show  that  this  is  the  case. 

Recall  that  v™  > for  some  t e C.  Put  8=6.  Since 
t m 

8^  = 8s  and  the  graph  Gg  contains  a chain  (in  f)  from  t to  s 


having  N,  say,  nodes,  then  (Pg)^u  > 0 for  each  u € C\L.  Also  because 

RgV  > v by  (1),  RgO  + P^v  = R^v  > v . Hence  vu  > f°r  each 

u e ^ as  claimed,  so  the  second  inequality  in  (l)  is  strict. 

From  the  above,  v^+1  > v*,  which  is  a contradiction.  Thus,  8^ 

is  halting.  Also  by  (l)  and  vi+"L  = v*,  v^"  = Rg  v^  and  so  v*  = v^  . 

i i 

Hence  as  in  Eaves  and  Veinott  [l],  v1  > RgV1  for  all  6,  so 

on  iterating  this  inequality  using  the  decisions  in  a policy  tt  we  get 

v^  > v^  + pV-  for  N > 1.  Thus  if  7r  is  halting,  v^  > v . Hence 
— 7T  tt  - - ir 

00 

8 is  a stationary  optimal  halting  policy,  establishing  (a). 

To  complete  the  proof,  it  remains  to  note  that  if  r&  » -°°  for  all 
8,  "optimal  halting"  is  replaced  by  "optimal  stopping"  and  "halting"  is 
inserted  before  "stationary"  everywhere  in  the  above  theorem  and  its 
proof,  and  "tt  is  halting"  is  replaced  by  "ir  is  stopping"  in  the  preceding 
paragraph,  then  the  proof  is  valid  as  is.  (The  finiteness  of  the  rg's 
is  used  only  in  the  next  to  last  sentence  of  the  proof  that  (c)  implies  (a).) 

00 

Remark.  One  immediate  consequence  of  the  above  result  is  that  if  8 is 
a halting  stationary  optimal  halting  (resp. , stopping)  policy,  then 

hg 

is  the  optimal  halting  (resp. , stopping)  value  for  each  halting 
decision  7. 


Algorithm.  The  above  proof  justifies  the  following  procedure  for  decer- 
mining  whether  or  not  a halting  stationary  optimal  halting  (resp. , 
stopping)  policy  exists,  and  if  so,  producing  such  a policy.  First, 
use  the  construction  of  Theorem  3*1  to  determine  whether  or  not  a 
halting  decision  exists,  and  if  so,  find  one,  say  7.  In  the  latter 
event  construct  8Q,  8 , &2,  ...  inductively  as  in  the  proof  of  Theorem 
4.2  until  an  0 < i < S is  found  for  which  "fjj'v  is  a fixed  point 
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I 

of  %.  In  that  event  8?  is  the  desired  halting  stationary  optimal 
halting  (resp. , stopping)  policy.  If  no  halting  decision  7 exists, 
or  if  7 is  halting  hut  it1  v is  not  a fixed  point  for  any 
0 < i < S,  then  no  halting  stationary  optimal  (resp. , stopping)  policy 
exists. 

Theorem  4.2  implies  that  if  the  r^’s  are  all  finite,  then  the 

optimal  halting  and  stopping  values  coincide.  However,  this  need,  not 

be  the  case  when  r_  = for  some  5 and  s as  the  following  example 
8s 

illustrates. 

Example.  The  Optimal  Halting  and  Stopping  Values  Need  Mot  Coincide. 

Suppose  there  is  one  state  and  two  actions  7 and  8 are  available 
therein.  If  7 is  chosen  at  time  k,  the  population  receives  a reward 
, of  -00  and  stops.  If  8 is  chosen  at  time  k,  the  population  receives 

zero  reward,  remains  in  the  state  with  probability  l/2,  and  stops  with 
probability  l/2.  Both  v^  = -°°  and  v^  = 0 are  fixed  points  of  It 
where  -ftv  = max (l/2  v,  -»).  Also  vg  is  the  optimal  stopping  value 
and  v^  is  the  optimal  halting  value. 


1/2 


Figure  1. 

Recall  z is  the  average  number  of  actions  in  a given  state.  Note 
2 

that  S z is  the  number  of  additions  and  approximately  the  number  of 
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comparisons  required  in  one  application  of  the  optimal  return  orxn 
Thus  the  number  of  operations  needed  in  S steps  of  successive  approxi- 
mations is  O(S^z). 

Incidentally  not  all  decisions  SQ,  8^,  6g,  ...  chosen  above  are 
necessarily  halting  even  though  the  first  and  the  last  decisions  are 
halting. 

Example.  Halting  and  Nonhalting  Decisions  are  Obtained  During  Inter- 
mediate Steps. 

Referring  to  Figure  2 below,  consider  a system  with  three  states 
labeled  1,  2,  3*  There  are  two  actions  available  in  states  1 and  3 
while  state  2 has  three  actions.  Let  a,  b,  c denote  respectively 
the  first,  second,  and  third  actions  available  in  a state.  Let  the 
rewards  be  r(l,a)  = 0,  r(l,b)  = r(2,a)  = r(2,b)  = r(2,c)  = r(3,a)  = 
r(3,b)  = 1.  Let  the  transition  rates  be  p(l|2,c)  = 1,  p(2|3,b)  = 
p(3|2,b)  = l/2  and  all  others  zero.  Note  that  8^  = (a, a, a)  is  halt- 
ing, but  8g  = (b,b,b)  is  not. 

The  Gauss-Seidel  Method  of  Successive  Approximations. 

In  general  the  method  of  successive  approximations  can  be  improved 
k+1 

by  updating  the  value  v immediately  after  it  is  computed.  Define 

s 

(Ts&V)k  = ^R8V^s  if  k = s>  and  (Tg  gV ) k = vk  otherwise.  Let 

T8  = TS8  T18*  0bserve  that  V = r6  + V Where  Q8  = Ps8  •'*  P16 

and  P c is  formed  by  replacing  the  s row  of  the  identity  matrix  by 
so 

th 

the  s^  row  of  Pg.  Define  the  Gauss-Seidel  optimal  return  operator 
^ by  ^v  = maXg^TgV. 

Our  goal  now  is  to  establish  analogs  of  Lemma  4.1  and  Theorem  4.2 
for  the  Gauss-Seidel  optimal  return  operator.  An  elegant  way  to  show 
this  is  to  note  that  both  of  the  above  results  and  their  proofs  are 
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valid  as  is  when  n and  R&  are  replaced  respectively  by  and  Tg. 

However,  there  is  one  complication  in  this  approach.  It  is  that  v^ 

is  not  the  value  of  a halting  policy  in  the  sense  defined  here.  However, 

is  ^e  value  of  a generalized  halting  policy  in  which  the  action 

chosen  at  each  time  depends  not  only  on  the  state  occupied  at  that  time 

but  also  on  the  last  state  visited.  More  precisely,  let  6 , ...  , 8^ 

be  decisions  with  CS^v,,  = T-  •••  TR  v . Then  Cc  v„  is  the  value 

7 6i  \ 7 7 

of  the  generalized  halting  policy  in  which  one  uses  8^,  ...  , 8^ 
consecutively,  each  for  as  long  as  the  states  visited  decrease  strictly. 
Thus  if  8^  is  used  at  time  n and  the  system  is  in  states  s and 
t at  times  n and  n+1  respectively,  then  one  uses  8^  in  period 
n+1  if  s > t and  6i+1  in  that  period  otherwise. 

We  can  avoid  the  use  of  these  generalized  halting  policies  by 
proving  a variant  of  the  natural  analog  of  Lemma  4.1.  First,  an  upturn 


in  a sequence  of  N positive  integers  s^,  s . . . , s is  an  integer 
1 < n < N such  that  s , < s where  s.  = 0.  Let  g_  be  the  limit 

as  N — 00  of  the  maximum  number  of  upturns  in  the  first  N states 

visited  starting  with  s^  = s and  using  tt  among  those  sequences  of 
length  N having  positive  probability.  Let  = max^0  be  called 

the  upturn  number  of  the  policy  tt.  Also  let  gg  = g M and  gc„  = g M 

6°°  8°°s 

Given  a stationary  policy  5 , we  can  compute  gg  by  considering 

the  graph  Gc.  Clearly  G„  is  circuitless  if  and  only  if  g.  < °°, 

o o o 

since  the  existence  of  a circuit  implies  there  is  at  least  one  arc 
(i,j)  in  Gg,  called  an  upturn  arc,  such  that  i < j.  If  Gg  is 
circuitless,  then  gg  is  one  plus  the  number  of  upturn  arcs  in  the 
chain  that  has  the  maximum  number  of  upturn  arcs.  This  chain  is  easily 
computed  by  any  algorithm  for  finding  a maximum  cost  chain  when  a value 
of  one  is  assigned  to  each  upturn  arc  and  all  others  are  assigned  value 
zero.  An  analog  of  Lemma  4.1  is  now  presented. 

Lemma  4.$. 

If  7 is  a halting  decision,  then  vy  is  nondecreasing  in 


k > 0. 

Also  -ftkv  < q^v  < ^Sv  for 

all  k 

> 0 

and 

all  v 

such  that 

> 

VI 

> 

-/  k 

Moreover,  for  each  v,  ^ v 

IV 

for 

all 

k > 0 

and 

_k 

T v = v 

7 7 

for  all  k > gy. 

Proof. 

Since  7 is  halting,  - T7V7  = V7#  Thus 

for  all  k > 0. 


Let  w be  such  that  w <*^w  and  let  6 be  a decision  such  that 
*$w  = R_w . Thus  T.  c.w  > w for  all  t.  Iterating  this  inequality, 


‘ . '■  "J  i 
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w1  = w.  For 


w"  = T t E * * • T,  Kw  > w for  s = 2,  3,  ...  , S.  Set 
s-l,o  lo  — 

s = 1,  2,  ...  , s we  have  Ofcw)o  = (R~w)  < (Lws)f  = (T_w)  < (Zfv)  , 


where  the  first  inequality  follows  from  the  monotonicity  of  Rc  and 

o 

the  second  equality  follows  from  the  definition  of  T^.  Thus  "ftw  <^V 
for  all  w such  that  w < *ftw.  Hence  since  v <”ftv,  we  have  by  in- 
duction that 


(1)  “ftkv  <mk"lv 

where  the  first  inequality  follows  from  what  was  shown  above  on  letting 
w = 7^  v and  the  second  inequality  follows  from  the  monotonicity  of 
Op  together  with  the  induction  hypothesis. 

From  the  definition  of 

aQf 

(2)  w <*ftw  implies  T^w  <”^w  for  all  s and  a . 

Let  y be  such  that  y <*fty  and  let  P be  a decision  satisfying 
Opy  = T^y.  Then  (2)  implies  T^y  <7*?y*  Applying  T^  sequentially 
for  s = 2,  3,  ...  , S yields 

(3)  V-T»o  Ts-a,B"RSy 

where  the  first  inequality  follows  by  iteratively  applying  (2)  for 
,j  = 1,  2,  ...  , s on  setting  w y.  Moreover  by  induction 

(4)  £ •>A3(k-1)v  l^Skv 


where  the  first  inequality  is  true  from  the  induction  hypothesis  together 
with  the  monotonicity  of  Op  and  the  second  inequality  follows  from  (3) 
on  letting  y =’^S(k_1^v.  We  conclude  from  (l)  and  (4)  that  v 
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implies  v < *)f-  v for  all  k > 0. 

From  the*  definition  of  , Ofv  > T^w  for  all  w.  By  induction, 
Or  v > v > T^,v  where  the  first  inequality  follows  from  the  mono- 

tonicity of  0^  together  with  the  induction  hypothesis  and  the  second 

k—  ] 

inequality  follows  from  what  was  shown  above  on  setting  w = T ' v. 

Thus  Of kv  > T^v  for  all  k. 

It  remains  to  show  T^v  = for  k > g^.  Call  a state  that  is 
immediately  accessible  from  a state  s when  7 is  used  a follower 
of  s.  Let  = C s : < k}  and  Lw  = {s;  wg  = v ) where  w is  an 

S-vector.  If  the  followers  of  s are  in  Lw,  then 


(5) 


(T  w)  = 
v s7  * s 


7s 


(P  w)  = 
' 7 s 


7 s 


(P7V7}s  = 


It  is  sufficient  to  show 


(6) 


<=Lk 

T^v 


for  k = 0,  1, 


since  J = {l,  2,  ...  , s)  for  k > e . For  k = 0,  (6)  becomes 
K / 

t>  CL^  which  is  trivially  true.  By  induction,  assume  (6)  holds  for 

i k-1 

k-1  and  consider  k.  Let  w'J  = T.^  •••  T^T^,  v for  1 < j < S and 
o k-1 

w = T v.  For  simplicity  of  notation,  let  L.  = L . . It  is  sufficient 
7 J WJ 

to  show  that 


( f)  J 2,  ...  , j}  C Lj  for  ,j  = 0,  ...  , S 

S k 

since  w‘  = T^v  and,  when  j = S,  (7)  becomes  (6).  Now  (7)  is  trivial 

for  j = 0.  Suppose  it  holds  for  j-1  and  consider  j.  Since  T. 

J f 

changes  only  the  ,j-th  component  of  any  vector, 

(8)  LJ-1C.L1U(J)  • 
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Thus  if  j is  not  in  .T^,  then  by  the  induction  hypothesis 

J n (1,  ...  , j)CJ  ntl,  ...  , j'llCL;.  If  j is  in  J , then 

the  set  B.  of  followers  of  j that  are  in  H.  = (j+1,  j+2,  ...  , S} 

J J 

must  have  upturn  number  at  most  k-1  and  the  set  of  followers  of 

j that  are  in  {l,  2,  ...  , j-l)  must  have  upturn  number  at  most  k. 

From  (6)  for  k-1,  B.  C ^H.C^n^CLj  1 where  the  third 

inclusion  follows  from  the  fact  that  the  operators  T^,  for  1 < p < j 

do  not  change  the  components  of  any  vector  with  indices  in  H..  From 

J 

(7)  for  j-l,  E^.  Cl  Lj  Thus  the  followers  of  j are  in  and 

from  (5)  w^  = (T  w^_1)  . = v . Hence  j e L,  and  thus  by  (7)  for 

J J J J J 

j-l  and  (8),  (7)  holds  for  j which  completes  the  proof. 

Remark.  Lemma  4.3  implies  that  the  Gauss-Seidel  method  converges  faster 
than  the  usual  method  of  successive  approximations,  i.e.,  the  value  of 
each  Gauss-Seidel  iterate  is  always  greater  than  that  of  the  usual 
method. 

Now  using  the  above  lemma  we  obtain  the  analog  of  Theorem  4.2. 
Theorem  4.4. 

Theorem  4.2  remains  valid  if  It  is  replaced  by  and  h by  g. 
Proof. 

The  proof  is  identical  to  that  of  Theorem  4.2  where  1i,  k/,  p7, 
and  h are  replaced  by  'jf,  Ty,  and  g respectively  and  Lemma  4.3 

is  used  instead  of  Lemma  4.1  with  only  one  exception.  The  exception 
is  that  the  above  replacements  are  not  made  in  the  last  paragraph  of 
the  proof  that  (c)  implies  (a).  Q.E.D. 
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Remark  1.  The  remark  and  algorithm  following  Theorem  4.2  also  remain 
valid  with  the  obvious  replacements. 


Remark  2.  The  preceding  theorem  implies  that  if  the  states  are  labeled 
such  that  Pg  is  lower  triangular  with  zero  diagonal  elements  for 
some  optimal  halting  (resp.,  stopping)  policy  5 , then  ^ v = vg 
for  every  halting  decision  7 since  gg  = 1.  The  number  of  oper- 
ations is  thus  reduced  by  a factor  of  g. 

In  some  cases  it  is  possible  to  determine  whether  there  exists  a 
halting  optimal  stopping  policy  without  computation.  The  following 
theorem  gives  sufficient  conditions  for  the  existence  of  a halting 
stationary  optimal  stopping  policy. 

Theorem  4.5. 

Assume  there  exists  a halting  policy  and  -»  « r^  < 0 for 

each  decision  5.  If  for  each  decision  6 and  each  circuit  C in 

G_,  the  product  of  the  transition  rates  around  C is  one  or  more  and 
o 

r_.  < 0 for  some  t e C,  then  there  exists  a halting  stationary 
at 

optimal  stopping  policy. 

Proof. 


By  hypothesis,  there  exists  a halting  policy  and  hence  a stopping 

policy.  Also  0 > ftf’O,  so  by  a result  of  Eaves  and  Veinott  [l], 

00 

there  exists  a stationary  optimal  stopping  policy  S , Vg  < 0,  and 

“Pv-  = v_ • Suppose  Gc  has  a circuit  consisting  of  the  nodes 
O O o 


fl,  2,  ...  , m)  with  < 0,  say.  Then  Vg^  < p^  •••  Pmvg^  — vgi 

where  p^  = p(k+l|k,8k)  for  k < m and  pm  = p(l|m,8m).  This  is  a 
contradiction,  so  8 is  halting  and  the  theorem  follows. 
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improvement)  terminates  at  the  N-th  iteration  with  such  a policy, 
and  its  value  is  the  indicated  fixed  point.  A combinatorial  algorithm 
for  finding  a stationary  halting  policy  or  showing  one  does  not  exist 
is  given.  The  running  time  of  each  of  the  above  algorithms  is  pro- 
portional to  the  product  of  the  numbers  of  states  and  nonzero  transition 
| probabilities.  The  results  are  applied  in  a companion  paper  with  Veinott 
to  find  minimum-concave-cost  flows  in  single-source  networks. 
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